1 Background

In this report, we will focus on studying the results of the mis-splicing noise analysis of Ataxia RNAseq samples: 46 Cerebellum and 45 Frontal Cortex samples. Samples will be pseudobulked by tissue, status disease and ataxia diagnosis, splitting the analysis in three different levels, including the results of the splicing noise between specific Ataxia diagnosis (i.e. SCA1, FRDA…) and control samples.

Changelog

v2.2

  • Added section “Additional studies” with a study about the influence of the number of samples to the results of the Wilcoxon paired-signed rank tests.

v2.1

  • Changed the reference transcriptome to Gencode v38.

v2.0

  • Now, the clusters always compare the same number of samples between cases and controls. Procedure explained in methodology.

  • Added results for Level 2 (AtaxiaSubtype) and references.

  • Changed the reference transcriptome from Gencode v43 to Gencode v39.

  • Updated methodology.

v1.1

  • Updated subsampling method to consider other covariates other than RIN. Each covariate was weighted to contribute according to its influence to the number of novel junctions for each sample. For both tissues, RIN was the most relevant contributor with more than 85% of the weight. Thus, no relevant variation from the previous approach was observed.

2 Methods

2.1 Analysis pipeline (splicing noise evaluation)

For every available BAM file to study, we apply the following steps:

  1. Download and extraction of BAM files: the files are downloaded from s3://ataxia-bulk-rnaseq/nextflow_first_attemp/Star_2_pass_by_indv/output/STAR/align/BAM_files/ and subfolders.

  2. Junction extraction: all junctions are extracted using regtools junction extract after sorting and indexing with samtools. A file is created for each BAM file in BED12 format.

  3. Junction annotation: the junctions are read from the previously created files and merged into a single dataframe of read junctions. We also register the number of reads of each junction in every sample. The junctions located within the ENCODE blacklisted regions v2 are removed. The junction_annot() function from the package dasper is used to annotate the junctions to the Gencode v38 reference transcriptome. All junctions not classified as either novel_donor, novel_acceptor or annotated are ignored. We also remove all junctions smaller than 25bp (base pairs) and annotated introns that are ambiguously assigned to more than one gene.

  4. Junction pairing: by looking for overlaps between the novel junctions and the annotated junctions for each sample, we measure the distance in bp between the novel and reference splice site. The annotated introns that are never associated to a novel junction are considered a never misspliced junctions.

  5. Filtering the distances: we remove the pairings in which a novel junctions are associated to more than one reference intron across different samples. For more information about this process, please see the methods section in Introverse paper [1].

Next, we need to decide on a clustering method to combine and compare different samples. More information in section about clustering.

  1. Measuring the mis-splicing ratio: by adding all novel junction read counts attached to an annotated intron across all samples in which the novel splice was observed, and then dividing by the total number of reads of the annotated intron and the novel junctions across the same set of samples, we obtain a measurement of the mis-splicing ratio for an given annotated intron at both the donor splice site and the acceptor splice site. For more information about the mis-splicing ratio, please see section MSR.

  2. Generation of the DB: two tables are created per each cluster: db_introns and db_novel. Each one contains the relevant information related to reference introns (including the never misspliced introns) and novel junctions. This includes the MaxEntScan scores, the percentage of protein-coding transcripts and the classification in u2 and u12 introns.

2.2 Clustering (pseudobulk)

In our dataset, we have a total of 95 samples, corresponding to 48 different individuals. A total of 4 samples are removed because they belong to individuals diagnosed as CANVAS and AIFM1.

Three different level of studies were studied in this report, and always different analyses for each tissue.

  • Level 1 (Type): whether the sample is diagnosed with ataxia or not (diases status). For the frontal cortex tissue, samples with \(RIN<4\) will be removed, while for the Cerebellum tissue only controls with \(RIN<=7\) are kept. This is to ensure non-significant differences in the RIN medians.

  • Level 2 (AtaxiaSubtype): two different analyses are performed: known ataxia cases vs. controls and unknown ataxia cases vs. controls. In both scenarios, no restrictions about RIN is required.

  • Level 3 (Diagnosis): a different analysis was performed for every ataxia diagnosis with at least three samples: FRDA, SCA1, SCA2 and SCA6. In all situations, control samples are selected to minimize a weighted Gower distance to the case samples (more information in following section).

2.3 Subsampling

Studies about the relationship in mis-splicing ratio’s median and the number of samples pseudobulked (reference) showed a clear correlation between the two. In order to avoid this effect in the comparisons between cases and controls, we decided to subsample the majority class until both classes have the same number of samples. The subsampling was performed so that the weighted Gower distance between the case samples and the control samples was minimized.

For datasets with both quantitative (i.e. RIN) and categorical (i.e. Brain bank) variables, also called mixed datasets, Gower’s distance is a common measurement of similarity between any two samples [2]. The similarity between samples can be defined as:

\[ S_{ij}=\frac{\sum_{k=1}^p s_{ijk}\delta_{ijk}w_k}{\sum_{k=1}^p \delta_{ijk}w_k} \] where \(p\) are the variables beeing compared, \(i\) and \(j\) refers to two different samples and:

  • Quantitative variables: \(s_{ijk}=1-|x_{ik}-x_{jk}|/R_k\) where \(R_k\) is the range of the variable \(k\).
  • Categorical variables: \(s_{ijk}=1\) if \(x_{ik}=x_{jk}\) and \(s_{ijk}=0\) otherwise.
  • \(\delta_{ijk}\): whether a comparison between sample \(i\) and sample \(j\) can be performed for variable \(k\).
  • \(w_k\): optional weight to increase or decrease the relevance of certain variables.

We decided to apply a weighting to the variables in order to increase the relevance of the main contributors to the number of junction reads across samples.

The subsampling process was the following:

  1. Variable weights: first, we divided the samples by tissue. Then, for each sample, we extract the number of reads associated to both annotated and novel junctions. Using a linear model to predict this number of reads, we measured the variance explained by each of the main covariates that will be employed in subsampling: RIN, PMI, Age at death, Brain bank and sex. The percentage of variance explained by each covariate will be considered as the weight in the next step.

  2. Gower’s distance between samples: for each minority class sample, the most similar majority class sample was selected without repetition (i.e. once two samples were assigned together, none of them can be selected again). Thus, the same number of samples between the two classes were obtained.

  3. Wilcoxon test: between the two sets of samples, a Wilcoxon test was executed to test whether there are significant differences in the RIN medians. If significant differences are found, some additional restrictions are applied to any of the sets. For example, for Cerebellum Level 1 study, control samples with RIN higher than 7 needed to be removed to ensure non-significant differences in the median.

All studies will be presented with the samples employed for both classes and Wilcoxon test results. The obtained weights are the following:

Covariate Cerebellum weights Frontal Cortex weights
RIN 0.989 0.855
PMI 0.001 0.132
Brain Bank 0.003 0.013
Age at death 0.001 0.000
Sex 0.006 0.000

2.4 Common annotated introns

For each study, we will only consider the common annotated introns between the two classes. To do so, we generate the following dataframes:

  • Common annotated intron table: we looped through both db_introns tables and extracted only the information from the common annotated introns in the clusters. To identify common annotated introns, we used their locus (i.e. seqname:start-end:strand), since it is a unique identifier. The goal is to have the same number of annotated between cases and controls.

  • Common novel junction table: we looped through both db_novel tables and extracted only the information from the novel junctions associated to common annotated introns. Thus, we first needed to calculate the common annotated intron table.

2.5 Studied metrics - Wilcoxon paired signed-rank test

The examples in this section corresponds to the Level 1 study for Cerebellum.

For each annotated intron, two Mis-Splicing Ratios (\(MSR\)) are calculated to provide a measurement of the mis-splicing frequency at the donor site (\(MSR_D\)) and acceptor site (\(MSR_A\)). We first sum all of the novel donor/acceptor junction read counts and then divide by the sum of all annotated intron and novel junction read counts across the specific samples. It follows this formula:

\[ \begin{equation} MSR_A = \frac{\sum_{i=1}^{N}j_i}{\sum_{i=1}^{N}j_i+\sum_{i=1}^Ns_i } \end{equation} \]

where \(j\) is the number of novel acceptor junction reads for a particular annotated intron, \(s\) is the number of annotated intron reads and \(N\) is the number of samples being studied. We can generate an \(MSR\) table in which each row corresponds to an annotated intron and each column to a cluster. Example of the generated \(MSR_A\) tables:

Once we have calculated the \(MSR_A\) and \(MSR_D\) for every annotated intron and cluster, we use the paired Wilcoxon signed rank test to study if there is a significant variation in the median \(MSR\) in cases vs. controls. To be more precise:

  • Null hypothesis (H0): the observations \(MSR_{case} - MSR_{control}\) are symmetric about \(\mu\) = 0.
  • Alternative hypothesis (H1): the observations \(MSR_{case} - MSR_{control}\) are not symmetric about \(\mu\) = 0 (i.e. the distribution of \(MSR_{case} - MSR_{control}\) is different than 0).

We obtain two p-values (one for each splice site) to reject or not the null hypothesis in favour of the alternative hypothesis. If the p-value is small enough to reject the null hypothesis, we then evaluate the effect size of the difference in medians. Even though effect sizes are complex to interpret, commonly accepted values are: 0.10-0.3 (small effect), 0.3-0.5 (moderate effect) and >=0.5 (large effect).

3 Results for Level 1 (Type)

The studies for cases vs controls for each tissue report the following effect sizes after the Wilcoxon paired signed-rank test:

In general, even if the difference in median is significant across clusters for both tissues and splice sites, the effect sizes are too small to be considered relevant. Results are also shown in the following table:

Tissue Splice site p-value Effect size Magnitude
Cerebellum Donor 0.000 0.048 small
Cerebellum Acceptor 0.000 0.051 small
Frontal Donor 0.819 0.000 small
Frontal Acceptor 0.000 0.022 small

Cerebellum

Distribution of sample RIN for Cerebellum level 1 study:

Control samples with \(RIN > 7\) were removed.

Frontal Cortex

Distribution of sample RIN for Frontal Cortex level 1 study:

All samples with \(RIN <= 4\) were removed.

4 Results for Level 2 (AtaxiaSubtype)

The studies for ataxia subtype vs. controls for each tissue report the following effect sizes after the Wilcoxon paired signed-rank test:

Results are also shown in the following table:

Subtype Tissue Splice site p-value Effect size Magnitude
KnownAtaxia
KnownAtaxia Cerebellum Donor 0.001 -0.008 small
KnownAtaxia Cerebellum Acceptor 0.000 -0.026 small
KnownAtaxia Frontal Donor 0.000 0.008 small
KnownAtaxia Frontal Acceptor 0.000 0.021 small
UnknownAtaxia
UnknownAtaxia Cerebellum Donor 0.000 0.096 small
UnknownAtaxia Cerebellum Acceptor 0.000 0.134 small
UnknownAtaxia Frontal Donor 0.000 -0.036 small
UnknownAtaxia Frontal Acceptor 0.000 -0.015 small

Cerebellum

Distribution of sample RIN for Cerebellum level 2 study:

Frontal Cortex

Distribution of sample RIN for Frontal Cortex level 2 study:

5 Results for Level 3 (Diagnosis)

The studies for diagnosis vs. controls for each tissue report the following effect sizes after the Wilcoxon paired signed-rank test:

Results are also shown in the following table:

Diagnosis Tissue Splice site p-value Effect size Magnitude
FRDA
FRDA Cerebellum Donor 0.000 0.035 small
FRDA Cerebellum Acceptor 0.000 0.039 small
FRDA Frontal Donor 0.000 -0.025 small
FRDA Frontal Acceptor 0.000 -0.028 small
SCA1
SCA1 Cerebellum Donor 0.020 0.006 small
SCA1 Cerebellum Acceptor 0.000 -0.015 small
SCA1 Frontal Donor 0.000 -0.011 small
SCA1 Frontal Acceptor 0.000 -0.016 small
SCA2
SCA2 Cerebellum Donor 0.011 0.006 small
SCA2 Cerebellum Acceptor 0.000 0.024 small
SCA2 Frontal Donor 0.000 -0.051 small
SCA2 Frontal Acceptor 0.000 -0.039 small
SCA6
SCA6 Cerebellum Donor 0.000 -0.039 small
SCA6 Cerebellum Acceptor 0.000 -0.033 small
SCA6 Frontal Donor 0.000 0.034 small
SCA6 Frontal Acceptor 0.000 0.052 small

Cerebellum

Distribution of sample RIN for Cerebellum level 3 study:

Frontal Cortex

Distribution of sample RIN for Frontal Cortex level 3 study:

6 Additional studies

6.1 Effects of number of samples in Wilcoxon tests

Previous results (not shown in this report) showed us that there seems to be a relationship between the number of samples being clustered and the median MSR values. This result is to be expected if we analyse the situation. We have already proven in the past that the number of samples increase the number of novel junctions, which increases the likelihood of a mis-splicing event to be found for any particular intron. As such, it is expected that more reference introns have a different than zero MSR value, which would lead to an increase in the median MSR.

The reduction in the proportion of MSR = 0 reference introns was observed for both tissues and splice sites in control samples:

To expand on this idea, we tested a Wilcoxon paired signed-rank test between different control samples. We will call cluster A as the one constructed from three control samples: the highest RIN, the lowest RIN and the median RIN. Then, different clusters will be generated based on different number of samples per each sample in cluster A. Let’s call cluster B1 to the one generated with three samples in total, selected to to minimize the weighed Gower’s dissimilarity index to each of the case samples of cluster A. Cluster B2 then would be created with two samples per sample in cluster A (i.e. 6 samples in total). These will be called clusters B.

Once we have clusters A and B, we execute a Wilcoxon test to compare the median MSR between both clusters and splice sites. Positive results would mean a higher MSR median in cluster A, while negative results represents the opposite. Results are shown in the following figure:

As expected, the effect size decreases as we increase the samples in cluster B. Thus, in the studies in which the number of control samples is bigger than the number off case samples, it is expected that the Wilcoxon tests report a decrease in the median MSR values between case and control samples.

7 Session info

Show/hide
## ─ Session info ───────────────────────────────────────────────────────────────────────────────────────────────────────
##  setting  value
##  version  R version 4.2.1 (2022-06-23)
##  os       Ubuntu 20.04.4 LTS
##  system   x86_64, linux-gnu
##  ui       X11
##  language (EN)
##  collate  en_US.UTF-8
##  ctype    en_US.UTF-8
##  tz       Etc/UTC
##  date     2023-06-05
##  pandoc   2.18 @ /usr/lib/rstudio-server/bin/quarto/bin/tools/ (via rmarkdown)
## 
## ─ Packages ───────────────────────────────────────────────────────────────────────────────────────────────────────────
##  package     * version    date (UTC) lib source
##  abind         1.4-5      2016-07-21 [1] RSPM (R 4.2.0)
##  backports     1.4.1      2021-12-13 [1] RSPM (R 4.2.0)
##  bit           4.0.5      2022-11-15 [1] RSPM (R 4.2.0)
##  bit64         4.0.5      2020-08-30 [1] RSPM (R 4.2.0)
##  bookdown      0.33       2023-03-06 [1] RSPM (R 4.2.0)
##  broom         1.0.4      2023-03-11 [1] RSPM (R 4.2.0)
##  bslib         0.4.2      2022-12-16 [1] RSPM (R 4.2.0)
##  cachem        1.0.8      2023-05-01 [1] RSPM (R 4.2.0)
##  car           3.1-2      2023-03-30 [1] RSPM (R 4.2.0)
##  carData       3.0-5      2022-01-06 [1] RSPM (R 4.2.0)
##  cli           3.6.1      2023-03-23 [1] RSPM (R 4.2.0)
##  codetools     0.2-19     2023-02-01 [1] RSPM (R 4.2.0)
##  colorspace    2.1-0      2023-01-23 [1] RSPM (R 4.2.0)
##  crayon        1.5.2      2022-09-29 [1] RSPM (R 4.2.0)
##  DBI           1.1.3      2022-06-18 [1] RSPM (R 4.2.0)
##  digest        0.6.31     2022-12-11 [1] RSPM (R 4.2.0)
##  doParallel  * 1.0.17     2022-02-07 [1] RSPM (R 4.2.0)
##  dplyr       * 1.1.2      2023-04-20 [1] RSPM (R 4.2.0)
##  evaluate      0.21       2023-05-05 [1] RSPM (R 4.2.0)
##  fansi         1.0.4      2023-01-22 [1] RSPM (R 4.2.0)
##  farver        2.1.1      2022-07-06 [1] RSPM (R 4.2.0)
##  fastmap       1.1.1      2023-02-24 [1] RSPM (R 4.2.0)
##  forcats     * 1.0.0      2023-01-29 [1] RSPM (R 4.2.0)
##  foreach     * 1.5.2      2022-02-02 [1] RSPM (R 4.2.0)
##  generics      0.1.3      2022-07-05 [1] RSPM (R 4.2.0)
##  ggforce       0.4.1      2022-10-04 [1] RSPM (R 4.2.0)
##  ggplot2     * 3.4.2      2023-04-03 [1] RSPM (R 4.2.0)
##  ggpubr        0.6.0      2023-02-10 [1] RSPM (R 4.2.0)
##  ggsignif      0.6.4      2022-10-13 [1] RSPM (R 4.2.0)
##  glue          1.6.2      2022-02-24 [1] CRAN (R 4.2.0)
##  gridExtra     2.3        2017-09-09 [1] RSPM (R 4.2.0)
##  gtable        0.3.3      2023-03-21 [1] RSPM (R 4.2.0)
##  here        * 1.0.1      2020-12-13 [1] RSPM (R 4.2.0)
##  highr         0.10       2022-12-22 [1] RSPM (R 4.2.0)
##  hms           1.1.3      2023-03-21 [1] RSPM (R 4.2.0)
##  htmltools     0.5.5      2023-03-23 [1] RSPM (R 4.2.0)
##  httr          1.4.5      2023-02-24 [1] RSPM (R 4.2.0)
##  iterators   * 1.0.14     2022-02-05 [1] RSPM (R 4.2.0)
##  jquerylib     0.1.4      2021-04-26 [1] CRAN (R 4.2.0)
##  jsonlite      1.8.4      2022-12-06 [1] RSPM (R 4.2.0)
##  kableExtra    1.3.4.9000 2023-01-30 [1] Github (haozhu233/kableExtra@292f607)
##  knitr         1.43       2023-05-25 [1] RSPM (R 4.2.0)
##  labeling      0.4.2      2020-10-20 [1] RSPM (R 4.2.0)
##  lattice       0.21-8     2023-04-05 [1] RSPM (R 4.2.0)
##  lifecycle     1.0.3      2022-10-07 [1] RSPM (R 4.2.0)
##  lpSolve       5.6.18     2023-02-01 [1] RSPM (R 4.2.0)
##  lubridate   * 1.9.2      2023-02-10 [1] RSPM (R 4.2.0)
##  magrittr      2.0.3      2022-03-30 [1] CRAN (R 4.2.0)
##  MASS          7.3-59     2023-04-21 [1] RSPM (R 4.2.0)
##  Matrix        1.5-4      2023-04-04 [1] RSPM (R 4.2.0)
##  mitools       2.4        2019-04-26 [1] RSPM (R 4.2.0)
##  munsell       0.5.0      2018-06-12 [1] RSPM (R 4.2.0)
##  pillar        1.9.0      2023-03-22 [1] RSPM (R 4.2.0)
##  pkgconfig     2.0.3      2019-09-22 [1] CRAN (R 4.2.0)
##  polyclip      1.10-4     2022-10-20 [1] RSPM (R 4.2.0)
##  proxy         0.4-27     2022-06-09 [1] RSPM (R 4.2.0)
##  purrr       * 1.0.1      2023-01-10 [1] RSPM (R 4.2.0)
##  R6            2.5.1      2021-08-19 [1] CRAN (R 4.2.0)
##  Rcpp          1.0.10     2023-01-22 [1] RSPM (R 4.2.0)
##  readr       * 2.1.4      2023-02-10 [1] RSPM (R 4.2.0)
##  rlang         1.1.1      2023-04-28 [1] RSPM (R 4.2.0)
##  rmarkdown     2.21       2023-03-26 [1] RSPM (R 4.2.0)
##  rprojroot     2.0.3      2022-04-02 [1] CRAN (R 4.2.0)
##  rstatix       0.7.2      2023-02-01 [1] RSPM (R 4.2.0)
##  rstudioapi    0.14       2022-08-22 [1] RSPM (R 4.2.0)
##  rvest         1.0.3      2022-08-19 [1] RSPM (R 4.2.0)
##  sass          0.4.6      2023-05-03 [1] RSPM (R 4.2.0)
##  scales        1.2.1      2022-08-20 [1] RSPM (R 4.2.0)
##  sciRmdTheme   0.3        2023-03-10 [1] local
##  sessioninfo * 1.2.2      2021-12-06 [1] RSPM (R 4.2.0)
##  StatMatch     1.4.1      2022-03-01 [1] RSPM (R 4.2.0)
##  stringi       1.7.12     2023-01-11 [1] RSPM (R 4.2.0)
##  stringr     * 1.5.0      2022-12-02 [1] RSPM (R 4.2.0)
##  survey        4.2-1      2023-05-03 [1] RSPM (R 4.2.0)
##  survival      3.5-5      2023-03-12 [1] RSPM (R 4.2.0)
##  svglite       2.1.1      2023-01-10 [1] RSPM (R 4.2.0)
##  systemfonts   1.0.4      2022-02-11 [1] RSPM (R 4.2.0)
##  tibble      * 3.2.1      2023-03-20 [1] RSPM (R 4.2.0)
##  tidyr       * 1.3.0      2023-01-24 [1] RSPM (R 4.2.0)
##  tidyselect    1.2.0      2022-10-10 [1] RSPM (R 4.2.0)
##  tidyverse   * 2.0.0      2023-02-22 [1] RSPM (R 4.2.0)
##  timechange    0.2.0      2023-01-11 [1] RSPM (R 4.2.0)
##  tweenr        2.0.2      2022-09-06 [1] RSPM (R 4.2.0)
##  tzdb          0.3.0      2022-03-28 [1] RSPM (R 4.2.0)
##  utf8          1.2.3      2023-01-31 [1] RSPM (R 4.2.0)
##  vctrs         0.6.2      2023-04-19 [1] RSPM (R 4.2.0)
##  viridis       0.6.3      2023-05-03 [1] RSPM (R 4.2.0)
##  viridisLite   0.4.2      2023-05-02 [1] RSPM (R 4.2.0)
##  vroom         1.6.3      2023-04-28 [1] RSPM (R 4.2.0)
##  webshot       0.5.4      2022-09-26 [1] RSPM (R 4.2.0)
##  withr         2.5.0      2022-03-03 [1] CRAN (R 4.2.0)
##  xfun          0.39       2023-04-20 [1] RSPM (R 4.2.0)
##  xml2          1.3.4      2023-04-27 [1] RSPM (R 4.2.0)
##  yaml          2.3.7      2023-01-23 [1] RSPM (R 4.2.0)
## 
##  [1] /usr/local/lib/R/site-library
##  [2] /usr/local/lib/R/library
## 
## ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────

8 References

[1]
Garcı́a-Ruiz S, Zhang D, Gustavsson EK, Rocamora-Perez G, Grant-Peters M, Fairbrother-Browne A, et al. Splicing accuracy varies across human introns, tissues and age 2023. https://doi.org/10.1101/2023.03.29.534370.
[2]
Hummel M, Edelmann D, Kopp-Schneider A. Clustering of samples and variables with mixed-type data. PLOS ONE 2017;12:e0188274. https://doi.org/10.1371/journal.pone.0188274.